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Abstract 

Background: Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking 
methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the 
protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined 
by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very 
sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the 
protein domain ranking methods. 

Results: To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG-Rank. 
Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of 
protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned 
with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. 
Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank 
achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based 
ranking methods. 

Conclusion: The problem of graph model and parameter selection in graph regularized protein domain ranking can 
be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in 
applying multiple graphs to solving protein domain ranking applications. 



Background 

Proteins contain one or more domains each of which 
could have evolved independently from the rest of the 
protein structure and which could have unique functions 
[1,2]. Because of molecular evolution, proteins with sim- 
ilar sequences often share similar folds and structures. 
Retrieving and ranking protein domains that are similar to 
a query protein domain from a protein domain database 
are critical tasks for the analysis of protein structure, func- 
tion, and evolution [3-5]. The similar protein domains that 
are classified by a ranking system may help researchers 
infer the functional properties of a query domain from the 
functions of the returned protein domains. 
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The output of a ranking procedure is usually a list of 
database protein domains that are ranked in descending 
order according to a measure of their similarity to the 
query domain. The choice of a similarity measure largely 
defines the performance of a ranking system as argued 
previously [6] . A large number of algorithms for comput- 
ing similarity as a ranking score have been developed: 

Pairwise protein domain comparison algorithms 

compute the similarity between a pair of protein domains 
either by protein domain structure alignment or by 
comparing protein domain features. Protein structure 
alignment based methods compare protein domain struc- 
tures at the level of residues and sometime even atoms, 
to detect structural similarities with high sensitivity 
and accuracy. For example, Carpentier et al. proposed 
YAKUSA [7] which compares protein structures using 
one-dimensional characterizations based on protein 
backbone internal angles, while Jung and Lee proposed 
SHEBA [8] for structural database scanning based on 
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environmental profiles. Protein domain feature based 
methods extract structural features from protein domains 
and compute their similarity using a similarity or dis- 
tance function. For example, Zhang et al. used the 32-D 
tableau feature vector in a comparison procedure called 
IR tableau [3], while Lee and Lee introduced a measure 
called WDAC (Weighted Domain Architecture Com- 
parison) that is used in the protein domain comparison 
context [9]. Both these methods use cosine similarity for 
comparison purposes. 

Graph-based similarity learning algorithms use the 

traditional protein domain comparison methods men- 
tioned above that focus on detecting pairwise sequence 
alignments while neglecting all other protein domains 
in the database and their distributions. To tackle this 
problem, a graph-based transductive similarity learning 
algorithm has been proposed [6,10]. Instead of comput- 
ing pairwise similarities for protein domains, graph-based 
methods take advantage of the graph formed by the exist- 
ing protein domains. By propagating similarity measures 
between the query protein domain and the database pro- 
tein domains via graph transduction (GT), a better metric 
for ranking database protein domains can be learned. 

The main component of graph-based ranking is the 
construction of a graph as the estimation of intrinsic man- 
ifold of the database. As argued by Cai et al. [11], there 
are many ways to define different graphs with different 
models and parameters. However, up to now, there are, 
in general, no explicit rules for choice of graph models 
and parameters. In [6], the graph parameters were deter- 
mined by a grid-search of different pairs of parameters. 
In [11], several graph models were considered for graph 
regularization, and exhaustive experiments were carried 
out for the selection of a graph model and its parame- 
ters. However, these kinds of grid-search strategies select 
parameters from discrete values in the parameter space, 
and thus lack the ability to approximate an optimal solu- 
tion. At the same time, cross-validation [12,13] can be 
used for parameter selection, but it does not always scale 
up very well for many of the graph parameters, and some- 
times it might over-fit the training and validation set while 
not generalizing well on the query set. 

In [14], Geng et al. proposed an ensemble mani- 
fold regularization (EMR) framework that combines the 
automatic intrinsic manifold approximation and semi- 
supervised learning (SSL) [15,16] of a support vector 
machine (SVM) [17,18]. Based on the EMR idea, we 
attempted to solve the problem of graph model and 
parameter selection by fusing multiple graphs to obtain 
a ranking score learning framework for protein domain 
ranking. We first outlined the graph regularized rank- 
ing score learning framework by optimizing ranking score 
learning with both relevant and graph constraints , and 



then generalized it to the multiple graph case. First a pool 
of initial guesses of the graph Laplacian with different 
graph models and parameters is computed, and then 
they are combined linearly to approximate the intrin- 
sic manifold. The optimal graph model(s) with optimal 
parameters is selected by assigning larger weights to them. 
Meanwhile, ranking score learning is also restricted to 
be smooth along the estimated graph. Because the graph 
weights and ranking scores are learned jointly, a unified 
objective function is obtained. The objective function is 
optimized alternately and conditionally with respect to 
multiple graph weights and ranking scores in an iterative 
algorithm. We have named our Multiple Graph regular- 
ized Ranking method MultiG-Rank. It is composed of an 
off-line graph weights learning algorithm and an on-line 
ranking algorithm. 

Methods 

Graph model and parameter selection Given a data set of 
protein domains represented by their tableau 32-D fea- 
ture vectors [3] X = {x\ } X2r - - >xn}> where X{ e IR 32 
is the tableau feature vector of i-th protein domain, x q 
is the query protein domain, and the others are database 
protein domains. We define the ranking score vector as 
f =[f lf f 2 , ...,/y] T e M. N in which ft is the ranking score of 
Xi to the query domain. The problem is to rank the protein 
domains in X in descending order according to their rank- 
ing scores and return several of the top ranked domains as 
the ranking results so that the returned protein domains 
are as relevant to the query as possible. Here we define two 
types of protein domains: relevant when they belong to 
the same SCOP fold type [19], and irrelevant when they do 
not. We denote the SCOP-fold labels of protein domains 
in X as C = {/i, fe, In}> where // is the label of /-th pro- 
tein domain and l q is the query label. The optimal ranking 
scores of relevant protein domains // = l q should be 
larger than the irrelevant ones k ^ l q , so that the 
relevant protein domains will be returned to the user. 

Graph regularized protein domain ranking 

We applied two constraints on the optimal ranking score 
vector f to learn the optimal ranking scores: 

Relevance constraint Because the query protein 
domain reflects the search intention of the user,/ should 
be consistent with protein domains that are relevant to 
the query. We also define a relevance vector of the protein 
domain as y =[yi,y2> • • • > jAf] T £ {!> ®} N where yt — 1, if 
X{ is relevant to the query and yi — 0 if it is not. Because 
the type label l q of a query protein domain x q is usually 
unknown, we know only that the query is relevant to itself 
and have no prior knowledge of whether or not others are 
relevant; therefore, we can only set y q = 1 while yu i ^ q 
is unknown. 
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To assign different weights to different protein domains 
in X, we define a diagonal matrix U as Uu = 1 when 
ji is known, otherwise Uu = 0. To impose the relevant 
constraint to the learning of f, we aim to minimize the 
following objective function: 



X. The query information is embedded in y and U, while 
the protein domain relationship information is embedded 
in L. The final ranking results are obtained by balancing 
the two sources of information. In this paper, we call this 
method Graph regularized Ranking (G-Rank). 



N 



mm O r (f) = J2(fi-yif u u 



i=l 



(1) 



(f-y) T £/(f-y) 



Graph constraint/ should also be consistent with the 
local distribution found in the protein domain database. 
The local distribution was embedded into a K nearest 
neighbor graph Q — {V, £, W}. For each protein domain 
xu its K nearest neighbors, excluding itself, are denoted 
by Mi. The node set V corresponds to N protein domains 
in X } while £ is the edge set, and (/,/) e £ if Xj G Mi 
or Xi e Mj. The weight of an edge (/,/) is denoted as 
Wy which can be computed using different graph defi- 
nitions and parameters as described in the next section. 
The edge weights are further organized in a weight matrix 
W =[ Wij] e R NxN , where W tj is the weight of edge (/,;). 
We expect that if two protein domains X{ and Xj are close 
(i.e.jWij is big), then / and fj should also be close. To 
impose the graph constraint to the learning of f, we aim to 
minimize the following objective function: 



1 N 

mi n&(f) = -Y J (fi-f j ) 2 W ij 



fm-fwi 



= f T Lf 



(2) 



where D is a diagonal matrix whose entries are Da = 
Ylh=i Wij and L = D — W is the graph Laplacian matrix. 
This is a basic identity in spectral graph theory and it pro- 
vides some insight into the remarkable properties of the 
graph Laplacian. 

When the two constraints are combined, the learning of 
f is based on the minimization of the following objective 
function: 



min O(f) = O r (f) + o?O gr (f) 
f 



= (f - y) T U(f - y) + af~Lf 



(3) 



Multiple graph learning and ranking: MultiG-Rank 

Here we describe the multiple graph learning method to 
directly learn a self-adaptive graph for ranking regulariza- 
tion The graph is assumed to be a linear combination of 
multiple predefined graphs (referred to as base graphs). 
The graph weights are learned in a supervised way by con- 
sidering the SCOP fold types of the protein domains in the 
database. 

Multiple graph regularization 

The main component of graph regularization is the con- 
struction of a graph. As described previously, there are 
many ways to find the neighbors Mi of X{ and to define the 
weight matrix W on the graph [11]. Several of them are as 
follows: 

• Gaussian kernel weighted graph: Mi of Xi is found 
by comparing the squared Euclidean distance as, 



H/y* , -V * I I ^ /v* ^ -V • 0 ^ / V * I /y* ^ /\* , 

v\<l v/vy | | A*i i J ~~ ' ^ j 

and the weighting is computed using a Gaussian 
kernel as, 



(4) 



Wu 



e 2„* , if e £ 
0, else 



(5) 



where o is the bandwidth of the kernel. 
• Dot-product weighted graph: Mi of Xi is found by 
comparing the squared Euclidean distance and the 
weighting is computed as the dot-product as, 



Wu 



xjxj, if (/,/) G £ 
0, else 



(6) 



• Cosine similarity weighted graph: Mi of Xi is found 
by comparing cosine similarity as, 



C(Xif Xj) — 



X- Xj 



(7) 



where a is a trade-off parameter of the smoothness 
penalty. The solution is obtained by setting the derivative 
of O(f) with respect to f to zero as f = (U + oiL)~ l Uy. In 
this way, information from both the query protein domain 
provided by the user and the relationship of all the pro- 
tein domains in X are used to rank the protein domains in 



and the weighting is also assigned as cosine similarity 
as, 



Wij = 



C(xt,Xj), if (/,;) g £ 
0, else 



(8) 
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• Jaccard index weighted graph: Mi of X[ is found by 
comparing the Jaccard index [20] as, 



J(Xi, Xj) 



\%i U %j\ 

and the weighting is assigned as, 



(9) 



Wij = 



Jipcuxj), if {U]) e £ 



0, 



else 



(10) 



• Tanimoto coefficient weighted graph: Mi of X{ is 

found by comparing the Tanimoto coefficient as, 



T(x[, Xj) — 



xj Xj 



\Xi\\ 2 + \\Xj\ | 2 - xjxj 



(11) 



and the weighting is assigned as, 



Wij = 



T(xi,Xj), if {if) e £ 
0, else 



(12) 



With so many possible choices of graphs, the most suit- 
able graph with its parameters for the protein domain 
ranking task is often not known in advance; thus, an 
exhaustive search on a predefined pool of graphs is nec- 
essary. When the size of the pool becomes large, an 
exhaustive search will be quite time-consuming and some- 
times not possible. Hence, a method for efficiently learn- 
ing an appropriate graph to make the performance of the 
employed graph-based ranking method robust or even 
improved is crucial for graph regularized ranking. To 
tackle this problem we propose a multiple graph regular- 
ized ranking framework, that provides a series of initial 
guesses of the graph Laplacian and combines them to 
approximate the intrinsic manifold in a conditionally opti- 
mal way, inspired by a previously reported method [14]. 

Given a set of M graph candidates {Q\, • • • ,Gm}> we 
denote their corresponding candidate graph Laplacians as 
7" = [Li, • • • By assuming that the optimal graph 

Laplacian lies in the convex hull of the pre-given graph 
Laplacian candidates, we constrain the search space of 
possible graph Laplacians o linear combination of L m in T 
as, 



M 



L = Y1 11,1 



(13) 



m—l 



where fi m is the weight of m-th graph. To avoid any 
negative contribution, we further constrain J2m=l f 1 ™ = 
1, fi m > 0. 

To use the information from data distribution approx- 
imated by the new composite graph Laplacian L in (13) 
for protein domain ranking, we introduce a new multi- 



graph regularization term. By substituting (13) into (2), we 
get the augmented objective function term in an enlarged 
parameter space as, 

M 

min O multi Hf,[i) = fi m {f T L m f) 

m=1 (14) 



s.t. ^ fi m = 1, fi m > 0. 

m=l 

where fi =[ • • • , [Am] T is the graph weight vector. 

Off-line supervised multiple graph learning 

In the on-line querying procedure, the relevance of query 
x q to database protein domains is unknown and thus the 
optimal graph weights fi cannot be learned in a supervised 
way. However, all the SCOP-fold labels of protein domain 
in the database are known, making the supervised learn- 
ing of /jL in an off-line way possible. We treat each database 
protein domain x q e V, q = 1, • • • , N as a query in the 
off-line learning and all the items of its relevant vector 
=[yi q , - - - , jNqV as known because all the SCOP-fold 
labels are known for all the database protein domains as, 



1 , if k = l q 

0 , else 



(15) 



Therefore, we set U = I N xN as a N x N identity matrix. 
The ranking score vector of the #-th database protein 
domain is also defined as f q =[y\ q , • • • , yNq] T - Substitut- 
ing f q , y q and U to (1) and (14) and combining them, 
we have the optimization problem for the #-th database 
protein domain as, 

min 0{f q , fi) = {f q - y q ) T {f q y q ) 

M 

+aJ2^qL m f q ) + pm\ 2 (16) 



m—l 



M 

S.t. ^ ^ IA m 
m—\ 



1, fl m > 0. 



To avoid the parameter fi over-fitting to one single 
graph, we also introduce the fa norm regularization term 
||/x|| 2 to the object function. The difference between f q 
and y q should be noted: f q e {1,0}^ plays the role of 
the given ground truth in the supervised learning proce- 
dure, while y q e M. N is the variable to be solved. While f q 
is the ideal solution of y q) it is not always achieved after 
the learning. Thus, we introduce the first term in (16) to 
make y q as similar to f q as possible during the learning 
procedure. 

Object function: Using all protein domains in the 
database q = 1, . . . , N as queries to learn /x, we obtain 
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the final objective function of supervised multiple graph 
weighting and protein domain ranking as, 



min 0(F, fi) = 

q=l 



(f q -Y q ) T V q -Y q ) 

M 



+ £IN| 2 



(17) 



m=l 

= Tr[(F- Y) T (F- y)J 

M 

+ aJ2^mTr(F T L m F)+p\\ IJ i\\ 2 

m=l 

M 

s.t. ^ ji m — 1, i± m > 0. 



where F =[ fi, • • • , f#] is the ranking score matrix with the 
#-th column as the ranking score vector of #-th protein 
domain, and Y =[ y v • • • , y^] is the relevance matrix with 
the #-th column as the relevance vector of the ^-th protein 
domain. 



Optimization: Because direct optimization to (17) is dif- 
ficult, instead we adopt an iterative, two-step strategy to 
alternately optimize F and \x. At each iteration, either F or 
/jL is optimized while the other is fixed, and then the roles 
are switched. Iterations are repeated until a maximum 
number of iterations is reached. 

• Optimizing F: By fixing /x, the analytic solution for 
(17) can be easily obtained by setting the derivative of 
0(F, fi) with respect to F to zero. That is, 



dO(F,fi) 



M 



dp = 2(F -Y)+2aJ2 t*m(L m F) = 0 

m—l 

M 

F=(I + aJ2^mL m y 1 Y 



m—l 



(18) 



• Optimizing \i\ By fixing F and removing items 
irrelevant to \x from (17), the optimization problem 
(17) is reduced to, 



M 



min a Y\ fi m Tr(F T L m F) + p\\fi\\ 



m—l 



M 



M 



= a ^ + P XI ^ 



m—l 



m—l 



(19) 



M 



s.t. /x m = i, /x m > o. 



m=l 



where e m = Tr(F J L m F) and e =[ e\, • • • , e^] T . The 
optimization of (19) with respect to the graph weight 
\i can then be solved as a standard quadratic 
programming (QP) problem [4]. 

Off-line algorithm: The off-line \i learning algorithm is 
summarized as Algorithm 1. 

Algorithm 1. MultiG-Rank: off-line graph weights 
learning algorithm. 

Require: Candidate graph Laplacians set T; 
Require: SCOP type label set of database protein 
domains C; 

Require: Maximum iteration number T; 

Construct the relevance matrix Y =[ yi q ] NxN where 
y iq if 1 1 = l q ,0 otherwise; 

Initialize the graph weights as \i^ m = ^, 
m = 1, • • • ,M; 
for t = 1, , T do 

Update the ranking score matrix F t according to 
previous /x^ -1 by (18); 
Update the graph weight according to 
updated P by (19); 
end for 

Output graph weight \x = yu 1 . 

On-line ranking regularized by multiple graphs 

Given a newly discovered protein domain submitted 
by a user as query xq, its SCOP type label Iq will be 
unknown and the domain will not be in the database V = 
{x\, - • • ,xn}< To compute the ranking scores of X{ e V to 
query xo, we extend the size of database to N+ 1 by adding 
xo into the database and then solve the ranking score vec- 
tor for xo which is defined as f =[fo, • • • ,/n] e using 
(3). The parameters in (3) are constructed as follows: 

• Laplacian matrix L: We first compute the m graph 
weight matrices { e rW-1)x(AH-1) with their 
corresponding Laplacian matrices 
{L m }% =1 e R(^+Dx(iV+i) for the extended database 
{xo,X\, - - • , xn}> Then with the graph weight /x 
learned by Algorithm 1, the new Laplacian matrix L 
can be computed as in (13). 
On-line graph weight computation: When a new 
query xo is added to the database, we calculate its K 
nearest neighbors in the database V and the 
corresponding weights Wo/ and Wjo,j = 1, • • • , N. If 
adding this new query to the database does not affect 
the graph i n the database space, the neighbors and 
weights Wij, i,j = 1, • • • ,N for the protein domains 
in the database are fixed and can be pre-computed 
off-line. Thus, we only need to compute N edge 
weights for each graph instead of (N + 1) x (N + 1). 
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• Relevance vector y: The relevance vector for xq is 
denned as y =[ yo, • • • , ja/-] t e {1, 0}^ +1 with only 
70 = 1 known and yu i = 1, — ■ ,N unknown. 

• Matrix U: In this situation, U is a (N + 1) x (AT + 1) 
diagonal matrix with £/oo = 1 and £/# = 0, 

i = l,..-,N. 

Then the ranking score vector / can be solved as, 

f = (U + aLy 1 Uy (20) 

The on-line ranking algorithm is summarized as Algo- 
rithm 2. 

Algorithm 2. MultiG-Rank: on-line ranking algo- 
rithm. 

Require: protein domain database V = {x\, • • • ,^}; 
Require: Query protein domain #0; 
Require: Graph weight fi; 

Extend the database to (N + 1) size by adding xq 
and compute M graph Laplacians of the extended 
database; 

Obtain multiple graph Laplacian L by linear 
combination of M graph Laplacians with weight fi as 
in (13); 

Construct the relevance vector y e R^ +1 ) where 
3/0 = 1 and diagonal matrix U e rW-D with 
Uu = 1 if i = 0 and 0 otherwise; 

Solve the ranking vector f for #0 as in (20); 

Ranking protein domains in V according to ranking 
scores f in descending order. 

Protein domain database and query set 

We used the SCOP 1.75A database [21] to construct the 
database and query set. In the SCOP 1.75A database, 
there are 49,219 protein domain PDB entries and 135,643 
domains, belonging to 7 classes and 1,194 SCOP fold 
types. 

Protein domain database 

Our protein domain database was selected from ASTRAL 
SCOP 1.75 A set [21], a subset of the SCOP (Struc- 
tural Classification of Proteins) 1.75 A database which was 
released in March 15, 2012 [21]. ASTRAL SCOP 1.75A 



40%) [21], a genetic domain sequence subset, was used 
as our protein domain database V. This database was 
selected from SCOP 1.75A database so that the selected 
domains have less than 40% identity to each other. There 
are a total of 11,212 protein domains in the ASTRAL 
SCOP 1.75A 40% database belonging to 1,196 SCOP fold 
types. The ASTRAL database is available on-line at http:// 
scop.berkeley.edu. The number of protein domains in 
each SCOP fold varies from 1 to 402. The distribution of 
protein domains with the different fold types is shown in 
Figure 1. Many previous studies evaluated ranking per- 
formances using the older version of the ASTRAL SCOP 
dataset (ASTRAL SCOP 1.73 95%) that was released in 
2008 [3]. 

Query set 

We also randomly selected 540 protein domains from the 
SCOP 1.75A database to construct a query set. For each 
query protein domain that we selected we ensured that 
there was at least one protein domain belonging to the 
same SCOP fold type in the ASTRAL SCOP 1.75A 40% 
database, so that for each query, there was at least one 
"positive" sample in the protein domain database. How- 
ever, it should be noted that the 540 protein domains in 
the query data set were randomly selected and do not 
necessarily represent 540 different folds. Here we call our 
query set the 540 query dataset because it contains 540 
protein domains from the SCOP 1.75A database. 

Evaluation metrics 

A ranking procedure is run against the protein domains 
database using a query domain. A list of all matching pro- 
tein domains along with their ranking scores is returned. 
We adopted the same evaluation metric framework as was 
described previously [3], and used the receiver operat- 
ing characteristic (ROC) curve, the area under the ROC 
curve (AUC), and the recall-precision curve to evaluate 
the ranking accuracy. Given a query protein domain x q 
belonging to the SCOP fold l q , a list of protein domains 
is returned from the database by the on-line MultiG-Rank 
algorithm or other ranking methods. For a database pro- 
tein domain x r in the returned list, if its fold label l r is the 
same as that of x qt i.e. l r = l q it is identified as a true pos- 
itive (TP), else it is identified as a false positive (FP). For a 
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Figure 1 Distribution of protein domains with different fold types in the ASTRAL SCOP 1 .75A 40% database. 
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database protein domain x r > not in the returned list, if its 
fold label l r > = l q , it will be identified as a true negative 
(TN), else it is a false negative (FN). The true positive rate 
(TPR), false positive rate (FPR), recall, and precision can 
then be computed based on the above statistics as follows: 



TPR. 



recall ■■ 



TP 



TP + FN' 
TP 

" TP + FN' 



FPR = 



FP 



precision 



FP+TN 

TP 
TP + FP 



(21) 



By varying the length of the returned list, different TPR, 
FRP, recall and precision values are obtained. 

ROC curve Using FPR as the abscissa and TPR as the 
ordinate, the ROC curve can be plotted. For a high- 
performance ranking system, the ROC curve should be as 
close to the top-left corner as possible. 

Recall-precision curve Using recall as the abscissa and 
precision as the ordinate, the recall-precision curve can 
be plotted. For a high-performance ranking system, this 
curve should be close to the top-right corner of the plot. 

AUC The AUC is computed as a single-figure measure- 
ment of the quality of an ROC curve. AUC is averaged over 
all the queries to evaluate the performances of different 
ranking methods. 



Results and discussion 

We first compared our MultiG-Rank against several pop- 
ular graph-based ranking score learning methods for 
ranking protein domains. We then evaluated the rank- 
ing performance of MultiG -Ranking against other protein 
domain ranking methods using different protein domain 
comparison strategies. Finally, a case study of a TIM barrel 
fold is described. 



Comparison of MultiG-Rank against other graph-based 
ranking methods 

We compared our MultiG-Rank to two graph-based rank- 
ing methods, G-Rank and GT [6], and against the pairwise 
protein domain comparison based ranking method pro- 
posed in [3] as a baseline method (Figure 2). The evalu- 
ations were conducted with the 540 query domains form 
the 540 query set. The average ranking performance was 
computed over these 540 query runs. 

The figure shows the ROC and the recall-precision 
curves obtained using the different graph ranking meth- 
ods. As can be seen, the MultiG-Rank algorithm sig- 
nificantly outperformed the other graph-based ranking 
algorithms; the precision difference got larger as the recall 
value increased and then tend to converge as the pre- 
cision tended towards zero (Figure 2 (b)). The G-Rank 
algorithm outperformed GT in most cases; however, both 
G-Rank and GT were much better than the pairwise rank- 
ing which neglects the global distribution of the protein 
domain database. 




0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 

FPR Recall 

Figure 2 Comparison of MultiG-Rank against other protein domain ranking methods. Each curve represents a graph-based ranking score 
learning algorithm. MultiG-Rank, the Multiple Graph regularized Ranking algorithm; G-Rank, Graph regularized Ranking; GT, graph transduction; 
Pairwise Rank, pairwise protein domain ranking method [3] (a) ROC curves of the different ranking methods; (b) Recall-precision curves of the 
different ranking methods. 
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The AUC results for the different ranking methods on 
the 540 query set are tabulated in Table 1. As shown, 
the proposed MultiG-Rank consistently outperformed the 
other three methods on the 540 query set against our pro- 
tein domain database, achieving a gain in AUC of 0.0155, 
0.0210 and 0.0252 compared with G-Rank, GT and Pair- 
wise Rank, respectively. Thus, we have shown that the 
ranking precision can be improved significantly using our 
algorithm. 

We have made three observations from the results listed 
in Table 1: 

1. G-Rank and GT produced similar performances on 
our protein domain database, indicating that there is 
no significant difference in the performance of the 
graph transduction based or graph regularization 
based single graph ranking methods for 
unsupervised learning of the ranking 

scores. 

2. Pairwise ranking produced the worst performance 
even though the method uses a carefully selected 
similarity function as reported in [3]. One reason for 
the poorer performance is that similarity computed 
by pairwise ranking is focused on detecting 
statistically significant pairwise differences only, 
while more subtle sequence similarities are missed. 
Hence, the variance among different fold types 
cannot be accurately estimated when the global 
distribution is neglected and only the protein domain 
pairs are considered. Another possible reason is that 
pairwise ranking usually produces a better 
performance when there is only a small number of 
protein domains in the database; therefore, because 
our database contains a large number of protein 
domains, the ranking performance of the pairwise 
ranking method was poor. 

3. MultiG-Rank produced the best ranking 
performance, implying that both the discriminant 
and geometrical information in the protein domain 
database are important for accurate ranking. In 
MultiG-Rank, the geometrical information is 
estimated by multiple graphs and the discriminant 
information is included by using the SCOP-fold type 
labels to learn the graph weights. 



Table 1 AUC results off different graph-based ranking 
methods 



Method 


AUC 


MultiG-Rank 


0.9730 


G-Rank 


0.9575 


GT 


0.9520 


Pairwise-Rank 


0.9478 



Comparison of MultiG-Rank with other protein domain 
ranking methods 

We compare the MultiG-Rank against several other pop- 
ular protein domain ranking methods: IR Tableau [3], QP 
tableau [4], YAKUSA [7], and SHEBA[8]. For the query 
domains and the protein domain database we used the 540 
query set and the ASTRAL SCOP L75A 40% database, 
respectively. The YAKUSA software source code was 
downloaded from http://wwwabi.snvjussieu.fr/YAKUSA, 
compiled and used for ranking. We used the "make 
Bank" shell script (http://wwwabi.snv.jussieu.fr/YAKUSA) 
which calls the phipsi program (Version 0.99 ABI, June 
1993) to format the database. YAKUSA compares a query 
domain to a database and returns a list of the pro- 
tein domains along with ranks and ranking scores. We 
used the default parameters of YAKUSA to perform the 
ranking of the protein domains in our database. The 
SHEBA software (version 3.11) source code was down- 
loaded from https://ccrod.cancer.gov/confluence/display/ 
CCRLEE/SHEBA, complied and used it for ranking. The 
protein domain database was converted to "env" format 
and the pairwise alignment was performed between each 
query domain and each database domain to obtain the 
alignment scores. First, we compared the different pro- 
tein domain-protein domain ranking methods and com- 
puted their similarity or dissimilarity. An ordering tech- 
nique was devised to detect hits by taking the similarities 
between data pairs as input. For our MultiG-Rank, the 
ranking score was used as a measure of protein domain- 
protein domain similarly. The ranking results were eval- 
uated based on the ROC and recall-precision curves as 
shown in Figure 3. The AUC values are given in Table 2. 

The results in Table 2 show that with the advantage 
of exploring data characteristics from various graphs, 
MultiG-Rank can achieve significant improvements in the 
ranking outcomes; in particular, AUC is increased from 
0.9478 to 0.9730 in MultiG-Rank which uses the same 
Tableau feature as IR Tableau. MultiG-Rank also out- 
performs QP Tableau, SHEBA, and YAKUSA; and AUC 
improves from 0.9364, 0.9421 and 0.9537, respectively, 
to 0.9730 with MultiG-Rank. Furthermore, because of 
its better use of effective protein domain descriptors, IR 
Tableau outperforms QP Tableau. 

To evaluate the effect of using protein domain descrip- 
tors for ranking instead of direct protein domain structure 
comparisons, we compared IR Tableau with YAKUSA 
and SHEBA. The main differences between them are 
that IR Tableau considers both protein domain feature 
extraction and comparison procedures, while YAKUSA 
and SHEBA compare only pairs of protein domains 
directly. The quantitative results in Table 2 show that, 
even by using the additional information from the pro- 
tein domain descriptor, IR Tableau does not outperform 
YAKUSA. 
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This result strongly suggests that ranking performance 
improvements are achieved mainly by graph regulariza- 
tion and not by using the power of a protein domain 
descriptor. 

Plots of TPR versus FPR obtained using MultiG-Rank 
and various field-specific protein domain ranking meth- 
ods as the ranking algorithms are shown in Figure 3 
(a) and the recall-precision curves obtained using them 
are shown in Figure 3 (b). As can be seen from the 
figure, in most cases, our MultiG-Rank algorithm sig- 
nificantly outperforms the other protein domain ranking 
algorithms. The performance differences get larger as 
the length of the returned protein domain list increases. 
The YAKUSA algorithm outperforms SHEBA, IR Tableau 
and QP Tableau in most cases. When only a few pro- 
tein domains are returned to the query, the sizes of both 
the true positive samples and the false positive samples 
are small, showing that, in this case, all the algorithms 
yield low FPR and TPR. As the number of returned pro- 
tein domains increases, the TPR of all of the algorithms 
increases. However, MultiG-Rank tends to converge when 
the FPR is more than 0.3, whereas the other ranking algo- 
rithms seems to converge only when the FPR is more 
than 0.5. 

Case Study of the TIM barrel fold 

Besides considering the results obtained for the whole 
database, we also studied an important protein fold, the 
TIM beta/alpha-barrel fold (c.l). The TIM barrel is a con- 
served protein fold that consists of eight a -helices and 
eight parallel -strands that alternate along the peptide 



backbone [22]. TIM barrels are one of the most common 
protein folds. In the ASTRAL SCOP 1.75A %40 database, 
there are a total of 373 proteins belonging to 33 different 
superfamilies and 114 families that have TIM beta/alpha- 
barrel SCOP fold type domains,. In this case study, the 
TIM beta/alpha-barrel domains from the query set were 
used to rank all the protein domains in the database. The 
ranking was evaluated both at the fold level of the SCOP 
classification and at lower levels of the SCOP classifica- 
tion (ie. superfamily level and family level). To evaluate the 
ranking performance, we defined "true positives" at three 
levels: 

Fold level When the returned database protein domain 
is from the same fold type as the query protein domain. 

Superfamily level When the returned database protein 
domain is from the same superfamily as the query protein 
domain. 



Table 2 AUC results for different protein domain ranking 
methods 



Method 


AUC 


MultiG-Rank 


0.9730 


IR Tableau 


0.9478 


YAKUSA 


0.9537 


SHEBA 


0.9421 


QP tableau 


0.9364 
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Family level When the returned database protein 
domain is from the same family as the query protein 
domain. 

The ROC and the recall-precision plots of the protein 
domain ranking results of MultiG-Rank for the query TIM 
beta/alpha-barrel domain at the three levels are given in 
Figure 4. The graphs were learned using the labels at the 
family, superfamily and the fold level. The results show 
that the ranking performance at the fold level is better 
than at the other two levels; however, although the per- 
formances at the lower levels, superfamily and family, are 
not superior to that at the fold level, they are still good. 
One important factor is that when the relevance at the 
lower levels was measured, a much fewer number of pro- 
tein domains in the database were relevant to the queries, 
making it more difficult to retrieve the relevant protein 
domains precisely. For example, a query belonging to the 
family of phosphoenolpyruvate mutase/Isocitrate lyase- 
like (c.1.12.7) matched 373 database protein domains at 
the fold level because this family has 373 protein domains 
in the ASTRAL SCOP 1.75A %40 database. On the other 
hand, only 14 and four protein domains were relevant to 
the query at the superfamily and family levels respectively. 

Conclusion 

The proposed MultiG-Rank method introduces a new 
paradigm to fortify the broad scope of existing graph- 
based ranking techniques. The main advantage of MultiG- 
Rank lies in its ability to represent the learning of a unified 
space of ranking scores for protein domain database in 
multiple graphs. Such flexibility is important in tackling 



complicated protein domain ranking problems because it 
allows more prior knowledge to be explored for effectively 
analyzing a given protein domain database, including the 
possibility of choosing a proper set of graphs to better 
characterize diverse databases, and the ability to adopt 
a multiple graph-based ranking method to appropriately 
model relationships among the protein domains. Here, 
MultiG-Rank has been evaluated comprehensively on a 
carefully selected subset of the ASTRAL SCOP 1.75 A 
protein domain database. The promising experimental 
results that were obtained further confirm the usefulness 
of our ranking score learning approach. 
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